Facile and highly precise pH-value estimation using common pH paper based on machine learning techniques and supported mobile devices

Numerous scientific, health care, and industrial applications are showing increasing interest in developing optical pH sensors with low-cost, high precision that cover a wide pH range. Although serious efforts, the development of high accuracy and cost-effectiveness, remains challenging. In this perspective, we present the implementation of the machine learning technique on the common pH paper for precise pH-value estimation. Further, we develop a simple, flexible, and free precise mobile application based on a machine learning algorithm to predict the accurate pH value of a solution using an available commercial pH paper. The common light conditions were studied under different light intensities of 350, 200, and 20 Lux. The models were trained using 2689 experimental values without a special instrument control. The pH range of 1: 14 is covered by an interval of ~ 0.1 pH value. The results show a significant relationship between pH values and both the red color and green color, in contrast to the poor correlation by the blue color. The K Neighbors Regressor model improves linearity and shows a significant coefficient of determination of 0.995 combined with the lowest errors. The free, publicly accessible online and mobile application was developed and enables the highly precise estimation of the pH value as a function of the RGB color code of typical pH paper. Our findings could replace higher expensive pH instruments using handheld pH detection, and an intelligent smartphone system for everyone, even the chef in the kitchen, without the need for additional costly and time-consuming experimental work.


Machine learning algorithms.
Regression is a technique used for prediction continues pH values learning and figuring out causal relations between the actual and prediction pH values. Eleven supervised machine learning regression models were applied to the data collected and choose the best model that fits with the selected problem, including Linear Regression (LR), Decision Tree Regressor (DT R ), Random Forest Regressor (RT R ), K Neighbors Regressor (KNN R ), Support Vector Regression (SVR), Lasso regression (L 1 ), Ridge Regression (L 2 ), Elastic Net regressor (EN R ), AdaBoost Regression (AB R ), Gradient Boosting Regressor (GB R ), and Artificial Neural Network Regressor (ANN R ). All models can be found in Scikit-learn in the class model 21 . In addition, the data visualization of exploratory data analysis and heatmap figures were created using the seaborn package based on python code 22 .
Metrics for regression. Several metrics were used for evaluating the regression models, coefficient of determination (R 2 ), Mean Squared Error (MSE), Mean Absolute Error (MAE), and Root Mean Square Error (RMSE) can be calculated Scikit-learn in class metrics according to Eqs. (1)(2)(3)(4). 23,24 where N is the number of recorded samples, y i is the predicted pH value, and ŷ i is the actual pH value.
Automated color-information-extraction from the captured images. To extract the color code (RGB) from images, we used a Python 3.7 code based on the OpenCV package to extract the RGB for each image 25 . We noted a small deviation of the RGB values at various positions in one image. Thus, the RGB values were estimated at seven distinct (X, Y) positions (10,10;15,15; 20,20; 25,25; 30,30; 35,35; 40,40) to cover the whole image as illustrated in Fig. 1. pH discrimination with a machine learning model. We exploited a KNN regression modelbased machine learning algorithm to study 2689 collected sample data using Python 3.7 and the scikit-learn package 26,27 . We randomly separated the data into training data (70%, i.e., 1880 samples) and testing data (30%, i.e., 808 samples). In the inference model training phase, the testing data was completely excluded. Furthermore, in machine learning, hyperparameters are those parameters that are explicitly provided by the user to influence the learning process and improve the learning of the model. Thus, we trained our models using a series of integer number [1,2,3,….], and as a result of that the optimal hyperparameters (highest coefficient, and lower errors) was found when we used K = 5. Figure 2 presents collections of 130 captures of an experimentally colored change of the pH paper (at 350 Lux) in the range of (0-14) by an interval of ~ 0.1 pH-value. It is worth mentioning that the traditional estimation based on the color change of pH paper is accompanied by a significant variance in pH value (~ 2). This high variance of pH value led to a noteworthy wrong estimation by eye detection. This finding encourages us to develop a new (1) www.nature.com/scientificreports/ simple and more precise method for pH-value detection. Thus, the experiments were extended to cover most of the three different illumination workplaces at 350, 200, and 20 Lux, that the user could work on. Moreover, the homogeneity of the color of the pH paper was emphasized by the collected color RGB code for seven distinct positions per capture. In total, the data set includes 2689 experimental RGB values from different illumination workplaces.

Result and discussion
To better understand the observed results in the different workplaces, Exploratory Data Analysis (EDA) of color code RGB against pH values with respect to different light intensities at 20, 200, and 350 Lux, was illustrated in Fig. 3.
The color code points were collected in three parts in a wide pH range. The significant changes in the color code of Red and Green or even Blue were in the range of (2.5: 9) pH values at the three different investigated workplaces of light intensities at (20,200, and 350 Lux). It is worth mentioning, that the blue color code at lowintensity light of 20 Lux (a little dark workplace) deviates from those obtained in higher or medium light intensity, which suggests avoiding future testing in low light conditions. In contrast, the results revealed no significant difference between the behavior of Red or Green colors at light intensity. The results show the increase in basicity (> 9) or increase in acidity and (< 2.5) could interpret the color and may produce less accurate prediction in that part of the pH range. Thus, this finding may encourage the scientific community to prepare higher sensitive material to work in strong acid and/or Strong base medium.
Furthermore, it is critical to recognize and evaluate how dependent each parameter is on the others. This knowledge can aid in the definition of the expectations that these interdependencies provide, leading to the creation of more effective pH devices and color-sensitive materials. Because of this, using a machine learning strategy, the statistical Pearson's correlation coefficients (r x,y ) between the pH parameters were investigated based on the following Eqs. (5) and (6):  The correlation between the pH parameters was presented with a heatmap in Fig. 4. The obtained results reflect an excellent higher negative correlation between the pH values with Red color (−0.77). In the same way, an acceptable correlation of pH value with the green color by (−0.38). The blue color showed an incredibly low correlation with pH value (0.044) from those observed in the red or green colors. This refers to that the blue color will have a small effect on the machine learning prediction compared to the red and green colors. In the  www.nature.com/scientificreports/ same way, the illumination of workplaces has no significant effect on the pH value by −0.03. Thus, the colored pH paper can be safely captured whatever the light intensity.

ML model prediction.
Using experimental data, a preliminary analysis of machine learning regression techniques was performed with optimal hyperparameters on K-Nearest Neighbors (KNN), Linear, Lasso, Elastic Net, AdaBoost, Neural Network, Random Forest, and Support vector machine (SVM), and Gradient Boosting Regressor algorithms [28][29][30] to estimate coefficients of determination (R 2 ) and the minimum errors of the corresponding regression evaluation metrics concerning root mean squared error (RMSE), mean absolute error (MAE), and mean squared error (MSE) as shown in Fig. 5 and recorded in Table 1.
It's obvious that the KNN model with optimal hyperparameters of five points performs significant result of R 2 (0.993) combined with the lowest errors of MSE, RMSE, and MAE (0.012, 0.320, and 0.182, respectively) compared to other models. In addition, the coefficient of the variation of the root means square error (CVRMSE) of KNN models shows a higher stability performance of 4.077 compared to other models. Further, the crossvalidation with K-fold of (3, 5, 10, and 20) was tested for confirming the stability of the models. However, no significant difference was found between the results, which verified the KNN models.
To deepen understanding, further investigation showed that the results of the model's prediction (based on test data) vs the experimentally obtained pH values are represented in the scatter plot in Fig. 6. The linear regression, elastic net, and Neural network algorithms could not recognize the whole experimental points, especially at the strong acid/base pH range. However, a precise estimate would be placed along a square-diameter line using KNN, Gradient boosting, Random Forest, and AdaBoost algorithms, which could be selected for further steps of deploying the code. Despite the higher performance and exceedingly small deviation of those algorithms, the KNN was chosen for deploying the machine learning mobile application due to having the lowest errors (RMSE; 0.32) and higher stability (CVRMSE; 4.08) as well.
It is now clear that the KNN model can successfully show the underlying patterns of the color RGB code in the pH value estimation based on experimental data collections. Thus, the machine learning approach based on this model was further expanded and used to develop a versatile platform able to predict the pH value using common pH paper with high accuracy. The online mobile application of the prediction model was developed  www.nature.com/scientificreports/ using python code and streamlit cloud (freely available) and permits the highly predicted determination of the pH value as a function of the RGB color code of common pH paper. As illustrated in Fig. 7 the mobile application includes three steps; starting with the input file which could be able to insert the pH paper capture (after being immersed in the target solution immediately). For more facility, we have coded three options (upload a picture, use a mobile camera, or insert a RGB color code). This step is followed by a built-in Machin learning process (without control from the user). Finally, the output of the pH value will appear on the screen.
Our study has a significant advantage over what is already used, Fig. 8 shows the fair comparison of pH instruments, pH paper, and the current study.
Furthermore, Fig. 9 shows the estimated pH value (output results) of the proposed mobile application in comparison with the real pH value. Interestingly, this correlation between real and estimated values in the whole range of pH (acid or base) is related to the higher accuracy of the used ML model.
However, Solmaz et al. 31 studied pH strips colorimetric detection using ML, as presented in Table 2. However, four different types of smartphones were used to check the accuracy of pH value predictions for three buffer solutions (pH = 3, 7, and 10). The default setting was used to avoid any smartphone effects. As shown in Fig. 10 and Table 3, the various smartphones do have no significantly different pH value estimations with an accuracy of more than 90% for each type.
Furthermore, Table 4 shows recommended conditions and limitations for using the application to achieve more accurate predictions.
Overall, the present findings solve the problem of pH accuracy using common pH paper without the need for additional costly and time-consuming experimental work. However, our approach solves the problems of excessive cost and maintenance required for traditional pH meters. www.nature.com/scientificreports/

Conclusion
The findings demonstrate a strong negative association between pH values and both the red color (−0.77) and the green color (−0.38). The blue color will have an insignificant impact on machine learning prediction which revealed a low correlation (0.044). The KNN model exhibits significant R 2 (0.993) results along with the lowest MSE, RMSE, and MAE errors (0.012, 0.320, and 0.182, respectively). This paper also demonstrated the potential of the ML approach to estimate the pH value of solutions using common pH paper. We developed a freely available application that supported mobile devices to predict the pH value based on ML and using common pH paper with precise results. Future research should consider the preparation of new optical material with extremely sensitive color changes in a strong acid/base medium.